Interactive History Browser

Lukasz A. Bartnik

2017-12-25

Introduction

The goal of this tutorial is to introduce the interactive history browser implemented in the experiment package. It follows one of the examples accessible via experiment::simulate_london_meters and is based on the London meter data.

History browser keeps track of all expressions evaluated in R session. It remembers all objects and plots, and allows the user to move back and forth in that recorded history.

In this short introduction, we will perform a simplified data exploration exercise, similar to what a “real” data exploration might look like. In order to keep the big picture clean, we avoid poking around too much.

We start by loading a number of packages we will need for our analysis. History tracker does not write down commands that do not produce new objects or plots, so it ignores this next block of code.

library(dplyr)
library(lubridate)
library(magrittr)
library(ggplot2)

Turn tracing on

Now it is time to load the experiment package and turn on its tracing capability. experiment will register a callback using addTaskCallback and using that callback it will keep record of changes in the global environment of our R session1.

library(experiment)
tracking_on()

Calling tracking_on() in a live R session will change the R prompt to [tracked] >. In this vignette, in order to make it easier to copy the R code, the promp remains hidden.

Preparing the data set

Here is the first command that produces a (new) data object. It reads, transforms and filters a CSV file distributed with the experiment package.

input <-
  system.file("extdata/block_62.csv", package = "experiment") %>%
  readr::read_csv(na = 'Null') %>%
  rename(meter = LCLid, timestamp = tstp, usage = `energy_kWh`) %>%
  filter(meter %in% c("MAC004929", "MAC000010", "MAC004391"),
         year(timestamp) == 2013)

Let’s look at the data. It turns out that the observations are recorded every 30 minutes.

head(input)
#> # A tibble: 6 x 3
#>       meter           timestamp usage
#>       <chr>              <dttm> <dbl>
#> 1 MAC000010 2013-01-01 00:00:00 0.509
#> 2 MAC000010 2013-01-01 00:30:00 0.453
#> 3 MAC000010 2013-01-01 01:00:00 0.500
#> 4 MAC000010 2013-01-01 01:30:00 0.621
#> 5 MAC000010 2013-01-01 02:00:00 0.197
#> 6 MAC000010 2013-01-01 02:30:00 0.176

Let’s agregate them and continue with hourly readings.

input %<>%
  mutate(timestamp = floor_date(timestamp, 'hours')) %>%
  group_by(meter, timestamp) %>%
  summarise(usage = sum(usage))

The first meter

We have three meters in the data set, MAC000010, MAC004391, MAC004929. We will look at them one by one, starting with this one.

input %<>% filter(meter == "MAC004929")

Just a glimpse on the full data set, before we look aggregations.

with(input, plot(timestamp, usage, type = 'p', pch = '.'))

All right! That doesn’t reveal much, how about breaking the data set down by hour and day of week? Any patterns here? We start with aggregating the input set into a temporary variable x.

x <-
  input %>%
  mutate(hour = hour(timestamp),
         dow  = wday(timestamp, label = TRUE)) %>%
  mutate_at(vars(hour, dow), funs(as.factor)) %>%
  group_by(hour, dow) %>%
  summarise(usage = mean(usage, na.rm = TRUE))

And now we can take a look at the by-hour plot:

with(x, plot(hour, usage))

And the hour-by-day-of-the-week breakdown:

ggplot(x) + geom_point(aes(x = hour, y = usage)) + facet_wrap(~dow)

So these are mean values. How about the distribution arund the mean? We can visualize that with a boxplot. Start with overwriting the x variable and then produce a new plot.

x <-
  input %>%
  mutate(hour = hour(timestamp),
         dow  = wday(timestamp)) %>%
  mutate_at(vars(hour, dow), funs(as.factor))
ggplot(x) + geom_boxplot(aes(x = hour, y = usage)) + facet_wrap(~dow)

OK! Let’s look at a linear model for this data.

m <- lm(usage ~ hour:dow, x)
summary(m)
#> 
#> Call:
#> lm(formula = usage ~ hour:dow, data = x)
#> 
#> Residuals:
#>      Min       1Q   Median       3Q      Max 
#> -1.04183 -0.19047 -0.03992  0.08349  3.09831 
#> 
#> Coefficients: (1 not defined because of singularities)
#>              Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)  0.761096   0.050023  15.215  < 2e-16 ***
#> hour0:dow1  -0.124288   0.070744  -1.757 0.078973 .  
#> hour1:dow1  -0.270596   0.070744  -3.825 0.000132 ***
#> hour2:dow1  -0.478827   0.070744  -6.768 1.39e-11 ***
...
#> hour22:dow7 -0.007462   0.070744  -0.105 0.916003    
#> hour23:dow7        NA         NA      NA       NA    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 0.3607 on 8592 degrees of freedom
#> Multiple R-squared:  0.3471, Adjusted R-squared:  0.3344 
#> F-statistic: 27.35 on 167 and 8592 DF,  p-value: < 2.2e-16

At this point we might decide we know enough. (We probably don’t yet, but for the sake of the presentation, we let’s assume we actually do. After all this is an introduction to the history browser, not to time series analysis.)

History recorded so far

So what does the history look like so far? We can open an interactive viewer by plotting the history object as returned by fullhistory(). It is a htmlwidget so when you do it in actual R session in RStudio, it will open in the viewer pane.

plot(fullhistory())

Each node represents either an object introduced at some point in time to R session, or a plot. Objects have their names displayed inside the node, plots are shown as thumbnails.

You can hover your mouse cursor over each node in the history and see the expression that produced the given object along with its general characteristics, like dimensions of a data.frame or the AIC value for a linear model.

Second meter

Let’s go back to the last step before narrowing down to just one meter. Clicking on the second input node in the history window copies this object’s identifier, which we can then use to bring back the state of R session just after it was created.

restore('40692a8b662cc0327e12076193c13a8b730d2fc0')

Now we can try a different house.

input %<>% filter(meter == "MAC000010")

We aggregate the data with the same query as before and look at the boxplot. Anything interesting here?

x <-
  input %>%
  mutate(hour = hour(timestamp),
         dow  = wday(timestamp)) %>%
  mutate_at(vars(hour, dow), funs(as.factor))
ggplot(x) + geom_boxplot(aes(x = hour, y = usage)) + facet_wrap(~dow)

History, again

The history looks different now, as there is a second branch reflecting the last three commands we have just issued.

plot(fullhistory())

Third house

OK, so how about the third house in the data set? We restore the same point in time again, and repeat the same sequence of commands.

restore('40692a8b662cc0327e12076193c13a8b730d2fc0')
input %<>% filter(meter == "MAC004391")
x <-
  input %>%
  mutate(hour = hour(timestamp),
         dow  = wday(timestamp)) %>%
  mutate_at(vars(hour, dow), funs(as.factor))
ggplot(x) + geom_boxplot(aes(x = hour, y = usage)) + facet_wrap(~dow)

As we can see, the history gets updated again to reflect the third branching on the third house in the data set.

plot(fullhistory())

Searching the history

Our last step will be reducing the size of the history graph presented in the widget. We do it with the query_by() function. Let’s start with finding all variables named input.

h <- query_by(is_named('input'))

Looking at the history graph reveals that it is now much smaller.

plot(h)

How about finding only data frames?

h <- query_by(inherits('data.frame'))
plot(h)

And finally we ask to see only the plots.

h <- query_by(inherits('plot'))
plot(h)

Re-building the vignette

In case you have problems when rebuilding this vignette, here is what my current R session is like:

library(devtools)
devtools::session_info()
#>  setting  value                       
#>  version  R version 3.4.3 (2017-11-30)
#>  system   x86_64, linux-gnu           
#>  ui       X11                         
#>  language en_US                       
#>  collate  en_US.UTF-8                 
#>  tz       America/Los_Angeles         
#>  date     2017-12-25                  
#> 
#>  package     * version    date       source        
#>  assertthat    0.2.0      2017-04-11 CRAN (R 3.4.0)
#>  backports     1.1.1      2017-09-25 CRAN (R 3.4.2)
#>  base        * 3.4.3      2017-12-01 local         
#>  bindr         0.1        2016-11-13 CRAN (R 3.4.2)
#>  bindrcpp    * 0.2        2017-06-17 CRAN (R 3.4.2)
#>  broom         0.4.2      2017-02-13 CRAN (R 3.4.2)
#>  clisymbols    1.2.0      2017-05-21 CRAN (R 3.4.2)
#>  colorspace    1.3-2      2016-12-14 CRAN (R 3.4.1)
#>  compiler      3.4.3      2017-12-01 local         
#>  crayon        1.3.4      2017-09-16 CRAN (R 3.4.2)
#>  datasets    * 3.4.3      2017-12-01 local         
#>  defer         0.3.0      2017-12-26 local         
#>  devtools    * 1.13.4     2017-11-09 CRAN (R 3.4.2)
#>  digest        0.6.12     2017-01-27 CRAN (R 3.4.0)
#>  dplyr       * 0.7.4      2017-09-28 CRAN (R 3.4.2)
#>  evaluate    * 0.10.1     2017-06-24 CRAN (R 3.4.2)
#>  experiment  * 0.1        2017-12-26 local         
#>  foreign       0.8-69     2017-06-21 CRAN (R 3.4.2)
#>  formatR       1.5        2017-04-25 CRAN (R 3.4.0)
#>  ggplot2     * 2.2.1      2016-12-30 CRAN (R 3.4.1)
#>  glue          1.1.1      2017-06-21 CRAN (R 3.4.2)
#>  graphics    * 3.4.3      2017-12-01 local         
#>  grDevices   * 3.4.3      2017-12-01 local         
#>  grid          3.4.3      2017-12-01 local         
#>  gtable        0.2.0      2016-02-26 CRAN (R 3.4.1)
#>  hms           0.3        2016-11-22 CRAN (R 3.4.0)
#>  htmltools     0.3.6      2017-04-28 CRAN (R 3.4.0)
#>  htmlwidgets   0.9        2017-07-10 cran (@0.9)   
#>  jsonlite      1.5        2017-06-01 CRAN (R 3.4.2)
#>  knitr       * 1.17       2017-08-10 CRAN (R 3.4.2)
#>  labeling      0.3        2014-08-23 CRAN (R 3.4.1)
#>  lattice       0.20-35    2017-03-25 CRAN (R 3.4.2)
#>  lazyeval      0.2.0      2016-06-12 CRAN (R 3.4.0)
#>  lubridate   * 1.6.0      2016-09-13 CRAN (R 3.4.0)
#>  magrittr    * 1.5        2014-11-22 CRAN (R 3.4.0)
#>  memoise       1.1.0      2017-04-21 CRAN (R 3.4.0)
#>  methods     * 3.4.3      2017-12-01 local         
#>  mnormt        1.5-5      2016-10-15 CRAN (R 3.4.2)
#>  munsell       0.4.3      2016-02-13 CRAN (R 3.4.1)
#>  nlme          3.1-131    2017-02-06 CRAN (R 3.4.2)
#>  parallel      3.4.3      2017-12-01 local         
#>  pkgconfig     2.0.1      2017-03-21 CRAN (R 3.4.2)
#>  plyr          1.8.4      2016-06-08 CRAN (R 3.4.0)
#>  psych         1.7.8      2017-09-09 CRAN (R 3.4.2)
#>  purrr         0.2.4      2017-10-18 CRAN (R 3.4.2)
#>  R6            2.2.2      2017-06-17 CRAN (R 3.4.2)
#>  Rcpp          0.12.13    2017-09-28 CRAN (R 3.4.2)
#>  readr         1.1.1      2017-05-16 CRAN (R 3.4.0)
#>  reshape2      1.4.2      2016-10-22 CRAN (R 3.4.1)
#>  rlang         0.1.4      2017-11-05 cran (@0.1.4) 
#>  rmarkdown     1.6        2017-06-15 CRAN (R 3.4.2)
#>  rprojroot     1.2        2017-01-16 CRAN (R 3.4.0)
#>  rsvg          1.1        2017-03-21 CRAN (R 3.4.3)
#>  scales        0.5.0      2017-08-24 CRAN (R 3.4.1)
#>  stats       * 3.4.3      2017-12-01 local         
#>  storage       0.1.0      2017-12-18 local         
#>  stringi       1.1.5      2017-04-07 CRAN (R 3.4.0)
#>  stringr     * 1.2.0      2017-02-18 CRAN (R 3.4.0)
#>  testthat    * 1.0.2.9000 2017-10-22 local         
#>  tibble        1.3.4      2017-08-22 CRAN (R 3.4.2)
#>  tidyr         0.7.2      2017-10-16 CRAN (R 3.4.2)
#>  tools         3.4.3      2017-12-01 local         
#>  utils       * 3.4.3      2017-12-01 local         
#>  withr         2.0.0      2017-07-28 CRAN (R 3.4.2)
#>  yaml          2.1.14     2016-11-12 CRAN (R 3.4.0)

  1. Actually, in order to prepare this vignette, a number of hacks needed to be implemented. Thus, as long as you can simply copy and run the code line by line in your R session, the actual source code of the vignette will reveal much more than that.